feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows#2639
Closed
ScriptedAlchemy wants to merge 9 commits into
Closed
feat(testing-framework): POC — Gherkin + JS dual front-end flow-IR with reusable prompt flows#2639ScriptedAlchemy wants to merge 9 commits into
ScriptedAlchemy wants to merge 9 commits into
Conversation
added 7 commits
June 9, 2026 23:10
Shared flow IR (variable table, named flows with scoped args/returns, keyword-to-node policy) with three authoring surfaces: .feature files via @cucumber/gherkin, a fluent typed JS API, and bindFeature sparse overlays with drift validation. Includes offline demo and unit tests with fake agents.
Offline-by-default demo narrating the login/checkout journey through pure Gherkin, pure JS, and bound overlay modes with scripted fake agents, proving identical traces across front-ends and diffing overlay changes. Experimental --live mode runs against the static demo shop.
…wip) In-progress increment: codex-backed general agent for the POC demo's live mode; validation and real-run verification still pending.
Completes the codex live-mode increment: lazy-load @midscene/core/ai-model in CodexGeneralAgent (keeps the package index importable under vitest), fail fast when a capture extracts an empty value, add the missing back-to-shop step the real journey exposed, note live-mode verdict nondeterminism in the trace comparison, and document the codex setup (codex login, auto-configured MIDSCENE_MODEL_* env) in POC-GHERKIN.md. Verified live against codex gpt-5.5: all three modes pass.
Share engine step bookkeeping and getReportFile between runCase and the IR executor, merge prompt/capture step scaffolding, dedupe var-record stringification and identifier regexes, drop dead API (PromptStepIR.role, unused executor options, FlowRegistry.names), clean up codex screenshot temp files per call, fix nested-JSON verdict parsing with a regression test, and memoize the demo's codex CLI probe.
- implement memo: 'once-per-run' flow memoization with a shareable
memoStore on RunScenarioOptions; only fully successful completions are
cached, hits replay returns with a narrated info step
- make verdict-channel instructions adapter-supplied (verdictInstructions
on GeneralAgentAdapter) so Pi keeps report_verdict wording while codex
prompts demand its JSON reply channel; adapter-neutral fail-closed reason
- bindFeature now throws on duplicate anchors targeting the same step
instead of silently merging overlays
- introduce structural UiAgentLike and use it across the engine/executor,
removing the `as unknown as Agent` casts from fakes and demo agents
- write codex screenshot temp files under midscene_run/tmp
(getMidsceneRunSubDir) instead of mkdtemp, keeping per-call deletion
- make feature()/FeatureIR symmetric with the Gherkin CompiledFeature
({ name, scenarios, flows }); CompiledFeature is now an alias
- update POC-GHERKIN.md to match
Deploying midscene with
|
| Latest commit: |
4b1a261
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://e535d003.midscene.pages.dev |
| Branch Preview URL: | https://poc-gherkin-ai-flows.midscene.pages.dev |
added 2 commits
June 10, 2026 03:54
…with three style folders - example/ now shows one realistic suite authored three interchangeable ways: style-1-gherkin (shared flows/*.feature + independent feature modules), style-2-js (shared defineFlow module + per-module *.flows.ts), style-3-overlay (sparse bindFeature patch over style-1's checkout.feature) - add compileSuite() to the gherkin front-end: glob a suite directory (or file list), merge all @flow definitions into one registry, fail loudly on duplicate flow names across files - add a second shared flow ("Add product to cart") and cart-inspection scenarios; extend demo-app with a second product, quantity controls and a header cart badge; scripted agents cover the new steps - demo runs the suite module-by-module per style, narrating each source file, and keeps the Gherkin-vs-JS trace parity proof and the overlay diff - rich first-reader comments per style (what flows, captures and overlays are); example/README.md is the orientation point; POC-GHERKIN.md updated
…escape hatch
Pure .feature files are fully sufficient on their own; style 2 is for
engineering-owned dynamic suites, and style 3 (bindFeature overlay) only
earns its keep for bind-time computed values, per-environment tweaks
without forking the feature file, and a drift-validated seam between prose
and JS. Adds a blunt "Which style do I need?" decision section to
example/README.md ("you probably only need style 1"), reframes the
read-this-first comments in styles 1 and 3, and aligns POC-GHERKIN.md's
mode-selection table. Docs/comments only — no behavior changes.
Collaborator
Author
|
Closing in favor of a fresh implementation: the POC validated the concept (AI-executed Gherkin, reusable prompt flows, three routing modes), and the follow-up design session concluded we should build it as a new standalone package — |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What is this?
A POC that lets you write UI tests as plain-English Gherkin (
Given / When / Then) with no step-definition code at all — the AI executes each step directly. It builds on the v2 testing-framework from #2589.Classic Cucumber needs a JS function registered for every step. Here, the steps are the prompts:
Whensteps → the Midscene UI agent acts on the page (aiAct)Thensteps → a general agent judges a fail-closed verdict from the screenshotI remember ... as "price"→ extracts a value into a variable table;{price}is substituted into later prompts mechanically — the model never sees a placeholderI run the "Login" flow→ calls a reusable, parameterized prompt flow (own scope, declared args/returns, optional once-per-run memoization) — the answer to chaining prompts across complex multi-step journeysThree interchangeable authoring styles, one engine
All three compile to the same internal flow-IR and run identically (the demo proves trace-for-trace parity). Decision rule: you probably only need style 1. Plain
.featurefiles run end-to-end with nothing else — styles 2 and 3 are an alternative and an optional escape hatch, never requirements..featurefiles only, end to enddefineFlow()/scenario()builders.featurestays the source of truth; a sparse JS overlay patches anchored steps, drift caught at bind time so the prose↔JS seam can't silently rotThe
example/directory has one folder per style, each laid out as a realistic multi-file suite (shared flows reused by separate cart/checkout test modules) — seeexample/README.md.Try it (no API key needed)
The offline demo narrates every resolved prompt, variable capture, flow call, and verdict, then diffs the three styles against each other.
Status / validation
codex://app-serverproviderpackages/testing-framework/POC-GHERKIN.mdFeedback wanted
I remember ... as "var"andI run the "X" flow with ...